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ABSTRACT 

We explore different estimators of the local non- linear coupling parameter, /nl, based 
on the binned bispectrum presented in Bucher et al. Using simulations of Wilkinson 
Microwave Anisotropy Probe (WMAP)-7yr data, we compare the performance of a 
regression neural network with a ^^-minimization and study the dependence of the 
results on the presence of the linear term in the analysis and on the use of inpainting 
for masked regions. Both methods obtain similar results and are robust to the use of 
inpainting, but the neural network estimator converges considerably faster. We also 
examine the performance of a simplified estimator that assumes a diagonal matrix 
and has the linear term subtracted, which considerably reduces the computational 
time; in this case inpainting is found to be crucial. The estimators are also applied to 
real WMAP-7yr data, yielding constraints at 95% confidence level of —3 < /nl < 83. 



1 INTRODUCTION 



Cosmic microwave background (CMB) fluctuations natu- 
rally arise in inflationary models. Discriminating between 
different models is a difficult task, but can be addressed by 
observing very faint non-Gaussian signals in the high-order 
correlation functions of the CMB temperature anisotropics. 
A popular approach is to search for the local form of non- 
Gaussianity, where the initial curvature Gaussian perturba- 
tions are expanded up to the second order as 

$ = + /nl [<fl - ($^)] 



(for m ore details see e.g. iBartolo et al. 1 12004 iBabich et al.l 
|2004| ). 

WMAP constraints on the amplitude of the local form 
of non-Gaussianity have b een able to rule out exotic mod- 
els such as ghost inflation (|Arkani-Hamed et al.|[20o3 ). New 
data sets, such as the rece nt release from Planck satellite 
l|Planck Collaboration|[2013l ). significantly reduce the uncer- 
tainties on local /nl, ruling out the ekpyrotic model and 
imposing strong constraints on multi-field inflationary mod- 
els. In fact, for single-field infiation, /nl (hereafter /nl is 
the l ocal form) s hould be of the order of the spectral in- 
dex (|Creminelli fc Zaldarriaga 2004), given the consistency 
relation derived in iMaldacenal ^200^ . Recent papers show 
that this relation does not hold for non-vacuum initial states 
l|Gandl201ll : lAgullo fc Parkeill201ll ) and non-constant super- 
horizon modes (I Chen et al.l 20131 ) . but the vast majority of 
single-field models should be ruled out by a detection of a 
larger /nl value. 

This type of primordial non-Gaussianity may be de- 
tected using higher-order correlation functions. The simplest 



of these is third-order, which is equivalent to the bispec- 
trum in spherical harmonic space. The first derivation of 
the optimal estimator, in the sense of an unbiased es timator 
that s aturates the Cramer-Rao inequality, is given in lBabiciil 
l|2005h . assuming an isotropic field. Working with real data, 
however, is usually more complicated. In particular, CMB 
maps have anisotropic noise due to the scanning strategy 
and masked regions, both of which break the isotropy as- 
sumption for these theoretical estimators. The masked re- 
gions are particularly difficult to treat, as they introduce 
correlations among the Fourier mo des, which are otherwise 
expected to be independent. iCreminelli et al. applied 
the optimal estimator to real data, showing that the pres- 
ence of a term proportional to the atm is required to ac- 
count for such anisotropics. In that paper the constraints 
are computed using an approximation to avoid numerical 
difficulties. Finally, this estimator was s uccessfully applied 
in its complete form to WMAP data bv (ISmith et al.ll2009l : 
iKomatsu et al.ll201ll : iBennett et"al]|2012l . for 5th. 7th and 
9th year respectively). 

New imaging reconstruction techniques have recently 
been used to pre-process CMB maps by smoothing the con- 
tours of the masked regions. A simple approach is to apodize 
the mask by introducing a smooth function in the pixels 
surrounding the masked regions. Another approach is to fill 
the masked regions with a pseudo-signal, which is termed 
inpainting. Several techniques have been proposed in the lit- 
erature for inpainting, whi ch is a very delicate process since 
the signal can be d istorted (|Baikovall2005l : lAbrial et al ]|2008l : 
IStarck et ai]|2013l ). 

Consequently, primordial non-Gaussianity analyses can 
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be computationally demanding, and new techniques should 
therefore be investigated to overcome the computational 
cost of large matrix estimations and inversions. Here we in- 
vestigate the utility of a neural network to obtain the neces- 
sary weights in the /nl estimator and compare it with the 
direct approach via minimization. Over the last 20 years, 
artificial intelligence techniques have been use in a number 
of areas of astrophysical analysis: morphological galaxy de- 
termination, photo-redshift estimations, and classification of 
different objects a re examples of successful applications of 
neural networks f^irth et al.ll20ol IVanzella fc et al.ll2004 
ICarballo et al.ll2008l ). In particular, for cosmological analy- 
sis, they have recently been used to reduce the computa- 
tional time of cosmological parameter est imation from ob- 
serva tions of the CMB powe r spectrum jAuk et al. 



I2OO8I ). Also in CMB analvsis. ICasaponsa et al 



I200I 



( 20111 ) used 



neural networks to define a new non-Gaussianity estimator 
and showed that networks are a valuable tool for bypass- 
ing the inversion of ill-conditioned matrices, and to avoid 
covariance matrix estimation in a analysis. 

The aim of the present work is to continue our earlier 
study of the power of the neural networks in the statisti- 
cal analysis associated with cosmic microwave background 
(CMB) non-Gaussianity. To this end, this paper is focused 
on the study of different tools, in order to identify the most 
robust and efficient estimator when dealing with real data. 
We compare three different approaches to estimate /nl, 
based on the binned bispectrum. The first estimator is ob- 
tained by minimizing a of the binned bispectrum com- 
ponents. A second approach is based on the optimal estima- 
tor, without taking into account the correlations among the 
binned bispectrum components, which for a isotropic field 
would be the same as the former. And the third method 
uses the weights of a regression neural network. From these 
approaches we construct different estimators to account for 
the effects of pre-processing the data with inpainting and 
the presence of a the linear term. 

The paper is organized as follows. An overview of the 
type of neural network employed and the training procedure 
is given in Section [51 In Section we describe the binned 
bispectrum. The definition of the estimators is presented in 
Section 0] followed by an explanation of the main details of 
the implementation in Section [S] The results are presented 
in section [6l and finally the conclusions are summarised in 
section [T] 



2 NEURAL NETWORKS 

Artificial neural networks (ANN) are a methodology for 
computing, based on massive parallelism and redundancy, 
which are features also found in animal brains. They con- 
sist of a number of interconnected nodes each of which pro- 
cesses information and passes it to other nodes in the net- 
work. Well-designed networks are able to 'learn' from a set of 
training data and to make predictions when presented with 
new, possibly incomplete, data. These algorithms have been 
successfully applied in several areas, in particula r, we note 
the followin g applica t ions i n cosmology: iBaccigalupi fc et al 
(2000); Fir th ei all (|2003|): [Bail et al.l ((2OO4I); lAuld etld 
( 2007 .. ,20081 ') : ICasaponsa et al.l (120111) and lN0rgaard-Nielsen 
( 2OI2I ). 



0^ 




Vk 

^0 


0^ 




^0 




T>0^ 





o 



o 



113 



o 



Figure 1. Schematic diagram of a 3-layer feed- forward neural 
network. 



The basic building block of an ANN is the neuron or 
node. Information is passed as inputs to the neuron, which 
processes them and produces an output. The output is typ- 
ically a simple mathematical function of the inputs. The 
power of the ANN comes from assembling many neurons 
into a network. The network is able to model very com- 
plex behaviour from input to output. We use a three-layer 
feed-forward network consisting of a layer of input neurons, 
a layer of 'hidden' neurons and a layer of output neurons. 
Figure [1] shows a schematic design of such a network. 

The outputs of the hidden layer and the output layer 
are related to their inputs as follows: 

hidden layer: h, = ^(^'(/f '); ' = ^ + g'^'^l) 

i 

output layer: y, = g^'^ (/f ' ) ; /f > = ^ h, + ^.^^ (2) 

j 

for each hidden node j and each output node k. The in- 
dex i runs over all input nodes. The functions g^^^ and g^^^ 
are called activation functions. The non-linear nature of 3^^' 
is a key ingredient in constructing a viable and practically 
useful network. This non-linear function must be bounded, 
smooth and monotonic; we use g^^\x) = tanhx. For g^-^^ 
we simply use g'^^^x) = x. The layout and number of nodes 
are collectively termed the architecture of the network. For 
a basic intro duction t o artifi cial neural network s the reader 
is directed to lMackavl (|2003l ') and lColdenl (| 19961 ). 

For a given architecture, the weights w and biases 9 
define the operation of the network and are the quantities 
we wish to determine by some training algorithm. Basically, 
the training process is an iterative algorithm that optimises 
a given objective function that quantifies the accuracy of 
the network outputs. We denote w and 9 collectively by 
the network parameters a. As these parameters vary dur- 
ing training, a very wide range of non-linear mappings be- 
tween inputs and outputs is possible. In fact, acco rding to 
a 'universal approximation theorem ' (|Leshnolll993 ). a stan- 
dard three-layer feed-forward network can approximate any 
continuous function to any degree of accuracy with appro- 
priately chosen activation functions and a sufficient number 
of hidden nodes. 

In our previous application of ANN to the estima- 
tion of /nl, a classification neural network was used 
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jCasaponsa et al"]|201l[ ). Here, we instead use a regression 
network, which we find to be as useful as the classification 
approach, and also allows a more direct comparison with the 
minimization procedure. Additionally, using a regression 
network has the advantage of reducing the network param- 
eter sp ace, making the trainin g faster. 

In lCasaponsa et al.l ()201ll ). we used neural networks for 
which the inputs were third-order moments of two wavelet 
decom positions of the CMB map: the Healpix wavelet 
(HW) (jCasaponsa et al.l[2011i^ and the spherical Mexican 
hat wavelet fSMHW) (|Curto et al.ll2009l . boill ). We found 
the resulting /nl estimator had the same accuracy as the 
standard one based on ^'^-minimization, but was much faster 
to evaluate. Here, the inputs to our neural networks are 
the es timator for the bispectrum proposed bv lBucher et al.l 
defined in a number of bins in /-space, which reduces 
the dimension of the problem by a factor of 10^. Our aim 
is to learn a mapping from the binned bispectrum compo- 
nents of the (possibly) non-Gaussian CMB (assembled into 
an input feature vector x) to the corresponding /nl of the 
map; this is discussed in more detail below. 

A suitable objective function for this problem is 



z:(a) 



(3) 



where the index n runs over the training data-set = 
{x^"^,t*"'}, in which the target vector t^"^ for the network 
outputs are the /nl values, as explained in the next sec- 
tion. One then wishes to find network parameters a that 
minimise this objective function as the training progresses. 
This is, however, a highly non-linear, multi-modal function 
in many dimensions whose optimisation poses a non-trivial 
problem. We perform this optimis ation using the Mem- 
Sys package (|Gu11 fc Skillind Il999l '). This algorithm con- 
siders the parameters a to have prior probabilities propor- 
tional to 6" ^^"^ where S(a.) is the pos itive- negative entropy 
functional (|Hobson fc Lasenbvl Il998l ). and q is a hyper- 
parameter of the prior that sets the scale on which variations 
in a are expected. The value of a is chosen to max;imise its 
marginal posterior probability, value of which is inversely 
proportional to the standard deviation of the prior. Thus 
for a given q, the log-posterior probability is proportional 
to £(a) -|-Q:S(a). For each chosen a there is a solution a that 
maximises the posterior. As a varies, the set of solutions a 
is called the maximum-entropy trajectory. We wish to find 
the solution for which £■ is minimised which occurs at the 
end of the trajectory where a = 0. For practical purposes we 
start at a large value of a and iterate downwards until a is 
sufficiently small so that the posterior is dominated by the 
£ term. MemSys performs this algorithm using conjugate 
gradient descent at each step to converge to the maximum- 
entropy trajectory. The required matrix of second deriva- 
tives of jC is approximated using vector routines only, thus 
circumventing the need for 0{N'^) operations required for 
exact calculations. The application of MemSys to the prob- 
lem of network training allows for the fast efficient training 
of relatively large network structures on large data sets that 
would otherwise be difficult to perform in a reasonable time. 
Moreover the MemSys package also computes the Bayesian 
evidence for the model (i.e. network) under consideration, 
(see for example Ijavnes fc BretthorstI |2003| . for a review) , 
which provides a powerful model selection tool. In principle. 



values of the evidence computed for each possible architec- 
ture of the network (and training data) provide a mechanism 
to select the most appropriate architecture, which is simply 
the one that maximises the evidence. 



3 BINNED BISPECTRUM 

Several approaches to bispectrum analyses have been pro- 
posed to reduce the dimensionality of the pro blem without 
losing significant information (see for example iBucher et al.l 
I2OIOI : iFergusson fc Shellardll201ll). In particula r, we use the 
bispectrum estimator defined in iBucher et al.l l|2010l) . The 
proposed method consists of joining the bispectrum com- 
ponents in bins, significantly reducing the computational 
time, but ma i ntaini ng the quality of the estimator of /nl. 
iBucher et all (|2010l ) show that this is the case for ideal 
maps, with isotropic noise and small symmetric masks. 
The binn ed bispectrum is also applied to Planck data in 
IPlanck Co llaboration (20i3) to constrain primordial non- 
Gaussianity. Here we study with more detail its applica- 
tions to realistic data, for which we used simulations with 
WMAP-7yr characteristics. 

As a starting point, the angle- averaged reduced bispec- 
trum is defined by 



Ti-^Tt^Ti^dfl , 



(4) 



where Tt{n) = X]m'^fm^(^)- The binned reduced bispec- 
trum is then 



babc= E E E ^<1*2'?3. 

iiGia izeh taeia 



(5) 



where 7„ are bins in £. This definition of the reduced bispec- 
trum, differing from the star i dard one by the facto r if-^t^e^ 
(for details see IBucher et allboiol : iKomatsul l20o3 ) . IS con- 
venient since one can write babe in terms of Ta, Tj, and Tc 
which are the binned maps: 



(6) 



The advantage of constructing maps in ^-bins is that the 
number of transformations to spherical harmonic space is 
significantly reduced. Then, the resulting bispectrum esti- 
mator is f aster to construct tha n the one based on the KSW 
estimator (|Komatsu et al.ll2005f ) or the SMHW l|Curto et all 

[ml). 



4 Fnl estimators 

The optimal estimator for /nl, in the sense of an unbiased 
estimator that saturates the Cramer-Rao inequality, is ob- 
tained by performing an Edgeworth expansion of the proba- 
bility di s tribut i on of the airn for weak l y non-Gaussian dat a 
l|Bal3ichl I2OO5I : ICremineUi et al. 1 I2OO6I : ISmith et all |2009| ). 
This estimator is found to have a cubic term and a lin- 
ear term in atrn- The latter term plays an important role 
under realistic conditions, where anisotropic instrumental 
noise and/or a mask is present. 

The form of this estimator can also be understood us- 
in g the propertie s of th e Wick product. As d e monst rated 
in IPonzelh et all (|2012l '). iMarinucci fc Peccatil (|201ll ) and 
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iPeccati fc Taggul l|201ll ). the Wick product of a cubic vari- average bispectrum Bi-^ 



M., is; 



able, which is given by 

: Xl,X2,Xi ■— X1X2X3 — Xi {X2X3) — X2 (xiXs) — X3 {X1X2) , 

(7) 

has a smaller variance than the cubic variable itself, while 
not affecting the mean value so long as the variables Xi are 
Gaussian and have a mean value of zero. Then, if we replace 
each cubic term in an estimator by its Wick product, it 
will yield an estimator with lower variance. Following this 
reasoning, the binned bispectrum defined in Sec. [3] can be 
replaced by its Wick product 



-{TiMTi.-iTMTi^ 



(8) 



Note that Ti = Ti(x), since there is a dependence on the 
pixe l for anisotropi c map s. 

iDonzeUi etall have proved that for the case of 

wavelet and needlet coefficients, the linear term is basically 
equivalent to removing the mean value of the coefficients. 
In order to see if this is the case for the binned bispectrum, 
we explore the option of substituting = Tn — {Tn), where 
(T„) is computed with the unmasked pixels. This would be 
less costly than estimating the correlation matrix {TaTb) re- 
quired for the computation of the linear term. 

In the following subsections, we describe three methods 
for choosing the weights that are used to construct the final 
/nl estimator. In each case, estimators are constructed with 
and without the linear term contribution to explore its im- 
portance. Also, the performance of these estimators is tested 
on inpainted and non-inpainted maps, with the methodol- 
ogy explained in Sec. 15.31 In all cases the original mask M is 
applied again at the final stage when computing the binned 
bispectrum components 



E 



47riVnix 



(9) 



where A'^pix — "^^Mi. The efficiency achieved by the esti- 
mators will be compared to that defined by the Cramer-Rao 
inequality. The Cramer-Rao bound states that the minimum 
variance for any unbiased estimator is given by the inverse 
of the Fisher matrix information. A useful reference value 
in the case of partial sky coverage is obtained from the full 
sky estimator corrected by the fraction of the available sky. 
Therefore, the minimum variance for /nl is estimated to be: 



2 



] 



(10) 



where A takes values 1, 2 or 6 when all Is are different, two 
are equal, or all are the same and fsky is the fraction of 
the sky available. For (|10|l to be used for a realistic case, 
the power spectrum must include the noise and the beam 
contribution. The beam also needs to be included in the 
bispectrum part. We have used WMAP-7yr characteristics, 
in particular the average of the two channels of 61 and 94 
GHz (V and W) and the extended mask KQ75. In terms 
of the reduced bispectrum defined in Sec. |3] the angular 



^112(3 



47r 



(2^1 + 1) (2^2 -Fl) (24 + 1) 
1 



X 



(11) 



ei I2 h 
000 



4.1 Approximated maximum-likelihood estimator 
(AMLE) 

The standard approach in this type of analysis is to use 
the fact that the third-order moments are nearly Gaussian, 
and therefore the maximum-likelihood estimator is obtained 
approximately by the minimization of a given by 

(12) 

where ^dejY is the expected value for /nl ~ 1 and 
CtTbi.de/ ^ {babc){hdef) " (babcbdef) ■ From the previous equa- 
tion is straightforward to show that the /nl estimator for 
an observed map is; 



abc,def 



/n 



{babc)''C^^^ 



i.obs 
def"def 



= E 

abc,def E/ i^'^bc) C'aftc.de/ (^de/ ) 



(13) 



abc,def 



In order to include the linear term correction, TaTbTc should 
be substituted by its Wick product (|8}, wherever it appears. 
The expected value of the linear term is zero, and thus it 
vanishes in the term of the estimator related to the model, 
whereas it needs to be included in the covariance matrix. 
Thus, the corresponding estimator is 



{ba 



/nl= Y1 

abc,def /-^ ^ ' 
abc.de f 



/ ^abc,def 



''ibc)^C!abc,d(if{bdef)^ 



(14) 



/ rr-t rri \rTiObs /T"' rri Xrr-iObs /'T"' ryn \ rriOh 

'\J-d,iJ-e,i)J-f,i — \J-d,iJ-f,i)J-e,i ^ {-I- e,i J- f ,i) J- d,'. 



where {babc}^ is estimated using the regression coefficient 
of a linear fit to the mean values of 1,000 simulations with 
different /nl values. For we assume that it is indepen- 
dent of /nl, which is a good approximation in the limit of 
weak non-Gaussianity, and it is thus estimated with Gaus- 
sian simulations (~ 25,000). The term {TaTb} is estimated 
with 1,000 Gaussian simulations. 



4.2 Approximated maximum likelihood estimator 
with diagonal covariance matrix (AMLED) 

The estimator proposed by iBucher et al.l l|2010l ) used the 
approximation of assuming a diagonal covariance matrix. 
In this case, the estimator simplifies significantly, since the 
covariance matrix does not need to be estimated or inverted, 
and one obtains 



/nl 



{babc)^ /v&r{babcKt 

abc '^{{bdefYf /varibdef) 



E 



(15) 



def 
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where var{babc) is the variance of the binned bispectrum 
components, which is computed with simulations. Besides 
its computational efficiency, another advanta ge of this esti- 
mato r is that can be obtained analytically (see lBucher et all 
I2OIOI . for details). 

Strictly speaking, this estimator is optimal only for a 
full-sky CMB experiment with isotropic noise (although it 
has been shown to work well also in presence of a reduced 
symmetric mask). Under realistic conditions, a linear term 
of a similar form to that used above needs to be added, such 
that 



(&aftc)Vvar(6ai,e) 

abc X](<^<''=-f)^)Vvar(6de/) 



( 1 rj-i rp rpOb 



def 



{Ta^iTb^i^Tc.i (Ta^iTc^i')Tlj.i (T\}^iTc,i)Ta,i^ 

As with the previous estimator, 1,000 simulations were 
used for the model estimation and another 1,000 to obtain 
var(&a()c). This implies a reduction by a factor > 10 in the 
number of simulations required with respect to the AMLE. 



4.3 Neural network estimator (NNE) 

The architecture of our 3-layer neural network is defined by 
three parameters: the number of input, output and hidden 
nodes. The first two are determined by the problem at hand; 
in this case the dimension of the input vector depends on 
the number of bins chosen and there is a single output. 

Although the MemSys algorithm provides routines to 
determine the optimal value of the number of hidd en nodes 
using the Bayesian evidence (|Gu11 fc Skillinelll999l) . in this 
application Whid is determined empirically by measuring the 
accuracy of the trained networks on an independent test- 
ing set. In this application, we have found that in fact the 
optimal architecture contains no hidden nodes, resulting in 
what is effectively a linear mapping between input and out- 
put. This is not surprising, since we are effectively 'asking' 
the network to learn the mean value and dispersion of the 
binned bispectrum components for each /nl; since the ex- 
pectation value is linearly dependent on the /nl, this net- 
work architecture trivially satisfies this requirement. Indeed, 
networks of this sort provide a simple way of obtaining the 
(pseudo)inverse of any matrix. 

Then, for zero hidden nodes, the single network output 
is just a linear function of the inputs. Once the network 
parameters (iS, 9) are found during the training process, the 
estimator for /nl is thus given by: 



/nl = Wabcbabc + < 



/nl = E (i^lv" E Ta,.Tb, 



Tc. 



equivalent to a neural network with parameters 
Wdef ^^ E 



abc E/ (^<^bc) C^^^^^^f{bdef} 
abc. def 



(19) 



(20) 



If this were the optimal linear combination to estimate /nl, 
the neural network would find the same result as the AMLE 
but avoiding all the expensive calculations required in the 
direct computation of this estimator (provided that we have 
chosen a linear combination for the NNE). Conversely, if 
that combination were not optimal, the network should be 

(16) 

able to find different, more optimal, weights. For instance, 
for the AMLE to be optimal, the considered statistics should 
follow a Gaussian distribution, whereas the NNE does not 
make any assumptions about the intrinsic distribution of the 
inputs. Therefore, the neural network is expected to perform 
better when working with non-Gaussian statistics. In addi- 
tion, the neural network does not require to assume that the 
covarinace matrix is independent of /nl. Even if this approx- 
imation works well for the current application, it may not 
always be the case, which would significantly complicate the 
calculation of the AMLE. In such cases the NNE would rep- 
resent a clear advantage over the minimization. Finally, 
we would also like to point out that, although for the current 
application a linear combination was found to be the best 
choice for the NNE, in a general case, this estimator is not 
restricted to a linear combination of the inputs, which can 
be useful in other problems. 



5 IMPLEMENTATION 

In this section the non-Gaussian simulations used for the 
analyses as well as some technical details required for the 
implementation of the estimators are described. 



5.1 Non-Gaussian simulations 

Two different sets of non-Gaussian realizations are used. 
A set generated with the map-making method propos ed in 
iFergusson et al.l ([2010.) and described also in Curto et al.l 
( 20 111), and a set of publicly available realisationi|3 generated 
bv lElsner fc Wandeltl (|2009l ). In the first method, the non- 



(17) 



As with the previous estimators the network is also trained 
including the linear term, in which case 



(18) 



Gaussian part of the map {a^Ji) is taken directly from the 
theoretical bispectrum, while the second algorithm starts 
from the primordial curvature fluctuations and is therefore 
more precise. 

The two different sets are used for the following rea- 
sons. Having a large number of independent realizations is 
necessary to train the network, as well as to test its perfor- 
mance with the number of training data. Since the flrst set 
is faster to produce, 30,000 independent realisations were 
generated as in Curto et al. ( 2011). In the analysis with the 
SMHW of ICurto et"al] l|201ll '). they found the dispersion on 
/n l to be sligh t ly lar ger than using the simulations of set 2. 
In lCurto et all l|201ll ). constraints on /nl are obtained with 
both sets finding a discrepancy of 5%. We find similar de- 
viations for the binned bispectrum. This is observed if the 



Comparing with the AMLE estimator, we can see that it is 



http: / / planck. mpa-garching. mpg.de / cmb / fnl-simulations / 
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average bispectrum at the numerator in (|10|) is computed 
with simulations with both sets. Then, as the model of set 1 
is given by an approximation, the minimum dispersion of the 
parameter obtained with realisations is slightly larger than 
using the analytical dispersion in eq. 1101 Conversely, using 
realisations of set 2 we find a closer value to the analytically 
computed lower bound. 

Hence, after proving that the NNE converges with few 
thousand realisations for the best performing form of the 
estimator, the second set is used for the final results. This 
is convenient to be able to compare our results with the 
Fisher dispersion of (ll Op. and with the one s obtained with 
the optimal estimator jKomatsu et aTlbOllh . where simula- 
tions equivalent to the ones of set 2 are used. 

The Gaussian and non-Gaussian harmonic coefficients 
of the CMB realisations, af^ and a^, either generated from 
set 1 or set 2, are combined to obtain the non-Gaussian 
realisation with diff'erent values of /nl: 

G I J. NG rr,^\ 
aim — aim + /NLlim . (.^ij 

Noise-weighted V+W band WM AP-7yr realizations were 
then constructed as ex plained in ICurto et al.1 (|2009l ) and 
[Casaponsa ct al.l| (l201ll l. and the KQ75 mask was then ap- 
plied, which covers roughly 29% of the sky. 



iBertalmio et al.ll2000l '). The goal of these methods is to re- 
store missing or damaged regions of an image to recover the 
original signal as far as possible. For CMB map reconstruc- 
tion, the ideal inpainting method would lead to a restored 
map preserving the statistical properties of the unmasked 
map. 

Different approaches have been used to reduce the dis- 
continuities generated by the mask edges in CMB maps, 
since they introduce undesirable correlations among the 
binned bispectrum components. As the intention here is to 
reduce this impact, rather than reconstruct the full map, we 
use a simple iterative process that averages over the direct 
ne ighbours of th e mask ed pixels, and is based on the work 
of lOliveira erahl l|200j ). 

One begins with the map T{x) and the binary mask 
M{x). Then each pixel of the masked map T' — T x M with 
value zero is substituted by the average of its immediate 
neighbours, whether masked or not, using the HEALPlx 
subroutine neighbours. The process is repeated 1,000 times, 
leaving the masked point sources completely inpainted and 
smoothing the edges of the galactic mask. The results of 
this process are illustrated in Fig. [2l We find that, in this 
case, the technique is more effective than simply using an 
apodized mask. 



5.2 Binning scheme 

One is free to choose the numbe r and size of the bin s in £- 
space for the binned bispectrum. iBucher et al.l (|201Cll ) found 
that for imax =2000 and 64 bins the results obtained were 
99.3% of the optimal value. For an application to WMAP, 
one has imax =1024, so the corresponding number of bins is 
32. We have tested the performance of the estimators with 
different number of bins and find that for ntin = 28 the 
results have converged. Therefore, the following results use 
this number of bins, which also provides a modest saving 
in computation with respect to 32 bins. Conversely to the 
exhaustive choosing of the binning scheme done in Bucher 
et al. (2010) estimator, here we simply use logarithmic bins. 
The logarithmic scale is chosen by imposing the condition 
that all bins have at least one £. 

The binned bispectrum components are computed 
from combinations of three binned maps TaTbTc = 
Efis/„ E<2e/i, 5I^£3g/, r^ir^sTfa. It can be noticed that 
some of the combinations ^ii'2^3 might not satisfy the tri- 
angle condition (£3 — £2 ^ £1 ^ £2 + £3)- To avoid as far 
as possible those undesirable combinations, we discard the 
binned bispectrum components where all the contained £ 
combinations do not meet the triangle condition. For that 
reason the components used are the ones that hold the fol- 
lowing condition: 

timin rmax ^ ijmax ^ fimax . rmax 

where £™™ and are the minimum and maximum value 
of £ of the bin /„. Then, the binned bispectrum for n5i„ = 28 
consists of 1077 components, whereas the full bispectrum 
would have ~ 10* components. 

5.3 Inpainting 

Several inpainting methods have been developed for 
general imaging reconstruction (see e.g. the review by 



5.4 Neural network training process 

To train our /nl network we provide it with an ensemble 
of training data V = {x'"^*'")}. The input vector x'") 
contains the binned bispectrum components, explained in 
Section[3l of the i*'^ simulated CMB map. The output target 
is the corresponding /nl value of the i*'^ CMB simulation. 
Thus, for nun = 28 the input vector has 1077 components, 
and the target vector f'"' for the network consists of only 
one component. From the training set, 20 per cent of the 
realisations are reserved for the validation process. 

The network weights are computed during the training 
procedure, which in this case requires only a few seconds. 
The performance of the network is validated during the 
training process using an independent set of testing data. 
Figure [3] illustrates the training evolution for the regression 
network with rihid = and ridata = 10, 000. In the top panel 
we plot the correlation coefficient between the target and 
the network outputs on the training set and the test set. 
We see that a divergence occurs around 60 iterations of the 
MemSys optimiser due to over-fitting. The same behaviour 
is confirmed if the root mean squared error is studied (bot- 
tom panel). The network parameters use to construct our 
final network estimator in (|17[) and (|18|) are the ones that 
give a maximum value of the correlation coefficient and a 
minimum of the root mean squared error in the validation 
data set. 

It is worth noting that for training the neural network, 
we need to choose a certain range of |/nl| to generate the 
required simulations. We find that [-220 220] is a safe interval 
for training the network, without significantly biasing the 
results for |/nl| up to 30. 
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Figure 2. Inpainting effect shown in the masked WMAP-7yr map. On the top the initial temperature map with the mask in grey and 
an amplified region are presented and on the bottom, the same map and region are given after inpainting. 
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Figure 3. In the top panel the Pearson correlation coefficient 
between true /nl value and the network estimator /nl for case 
3 of table [2] versus the number of iterations. Bottom panel is for 
the root mean squared error of /nl at each iteration. Asterisks 
denote training data and dots denote validation data. 



6 RESULTS 

As a preliminary check, we applied the three estimators to 
Gaussian full-sky maps without noise, finding very similar 
results in all cases (see table[T]). In this ideal case, the AMLE 
should in principle coincide exactly with the AMLED, but 
because of the lack of correlations among the binned bispec- 
trum components the AMLED seems to be more efficient. 
This is probably due to numerical uncertainties that arise 
in the covariance matrix estimation. The neural network is 
found to be nearly as efficient as the AMLED. 

An important difference between the estimators is the 



Estimator 




0"9 


< /nl ><^°"''^ 


AMLED 




9.7 


-0.2 


AMLE 


9.7 


10.3 


-0.3 


NN 




9.8 


-0.2 



Table 1. Results for noiseless full sky maps of set 1. The first 
column is for the estimator used, second column indicates the 
expected dispersion for irnax = 1024 and in the last two columns 
are shown the dispersion and mean value found for 1,000 Gaussian 
maps. 



total number of realisations required to converge, which is 
directly related to the computational efficiency. For the AM- 
LED, a few hundred realisations are sufficient to estimate 
the variance of the binned bispectrum. For the AMLE es- 
timator, however, it is necessary to estimate the covariance 
matrix, which requires at least 25,000 Gaussian simulations. 
For the NNE, a few thousand realisations are required for 
the training process to converge. Nonetheless, it is worth 
noting that the number of training realisations required by 
the NNE does vary with the case being studied. For ex- 
ample, for inpainted maps where neither the linear term is 
taken into account nor the mean is subtracted (case 1 of 
table [5} , the NNE needs 10,000 independent simulations to 
converge. 

In applying the three estimators to realistic simulations, 
based on WMAP-7yr data, larger differences are observed in 
the results; these are summarised in table (2] We find that 
the AMLED estimator reaches values close to the expected 
dispersion if and only if the linear term is subtracted and 
inpainting is performed. Actually, if the estimator is applied 
to non-inpainted maps, the dispersion worsens by a ~ 60%. 
Of course, in the absence of the linear term, the estimator 
becomes highly suboptimal, giving errors of 300%. This is 
not the case for the other two estimators. We notice that the 
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Figure 4. Comparison of the efficiency (top) and bias (bottom) 
of the three estimators with respect to the number of simulations 
used to construct the estimator. For reference, the optimal values 
for the dispersion and bias (dashed black line) are also shown. 
Note that for the NNE, the simulations are used for the training 
process, whereas for the AMLE they are employed to estimate 
the covariance matrix. For the AMLED, they correspond to the 
number of simulations used to obtain the diagonal elements of 
the covariance matrix. 



full covariance matrix estimator and the neural network 
give similar results if instead of taking into account the linear 
term, the mean value of the intermedia te maps is subtracted , 
as is the case for wavelets and needlets (iDonzelli et aljlioisl ) . 
This is observed in both inpainted and non-inpainted maps, 
comparing cases 2 and 3 and 4 and 5 respectively (see ta- 
ble [2]). Indeed, these estimators appear more robust, since 
the improvement due to the inpainting is small. In partic- 
ular, comparing cases 2 and 4, the NNE estimator without 
inpainting increases the dispersion only by 5% and for the 
full estimator by ~ 10%, while for the AMLED the results 
are much worse. Although similar results are found with the 
AMLE and the NNE estimators, one important difference is 
the number of simulations required to construct them. As 
commented before, 25,000 Gaussian realizations were used 
to estimate the covariance matrix in AMLE. As shown in 
top panel of Fig. |4l the NNE requires dramatically fewer 
training realisations and also has the advantage that the av- 
erage value of the binned bispectrum at /nl = 1 does not 



S 



data 
■ y=x 
- linear fit 




-2 2 

NN weigfits 



x 10 



Figure 5. Weights for the AMLE estimator involving the covari- 
ance matrix and the model, versus the NN weights obtained after 
the training process. This comparison is made when both estima- 
tors have converged presenting a linear fit slope and intercept of 
a = 91,6 = 2 X 10^. 



need to be estimated. In the same figure, bottom panel, we 
plot the bias found for the /nl estimates for 1,000 Gaus- 
sian realisations for the three estimators with the number 
of simulations used. One sees that the AMLE requires more 
realizations than the other two estimators to produce unbi- 
ased results. 

All these results indicate that the neural network is a 
viable short cut to obtaining the necessary weights to con- 
struct the AMLE estimator. In Fig. [5] the weights found for 
the neural network are compared to those of the AMLE. 
Note that the weights of both estimators are very similar, 
validating the relation stated in (|19[) . The contribution of 
the network parameter 6 is negligible for all cases. 

In terms of computational demand, the most efficient 
estimators are the NNE and the AMLED, with the number 
of simulations required at least 10 times smaller than for the 
AMLE. Note that for the AMLED we have used realisations 
to estimate the average of the bispectrum at /nl = 1, there- 
fore the final number of realisations employed is similar to 
the ones used for training the NNE. 

For all three estimators, the best results are obtained 
when the map is inpainted and the linear term is subtracted 
(see case 3 of tabled indicated in bold face). For this opti- 
mal case, we compute {hg bcY with 1,000 simulations of set 2 
sner fc WandeltllioogI ). to compare it with the expected 
dispersion for a WMAP-7yr characteristics, computed as in 
pOjl . The neural network is now trained with this set of airn. ■ 
As we have seen, the NNE typically requires 2,500 indepen- 
dent training realisations to converge. Since only 1,000 are 
available, we therefore generated 10,000 simulations using 
the same set of aim rotating them and addin g different noise 
contributions. This procedure was used in iCasaponsa et al.l 
l|201ll ) and was found to be useful when only a small number 
of realisations is available. 

In table [3] the final results for all of the estimators are 
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No 
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AMLED 


31.5 


0.7 


40 


AMLE 


24.0 


0.7 


6.7 


NN 


23.1 


0.5 


2.7 


4 


No 


Yes 


No 


AMLED 


35.9 


-0.3 


60 


AMLE 


24.3 


0.1 


9.3 


NN 


23.6 


0.6 


4.8 


5 


No 


No 


Yes 


AMLED 


37.0 


1.5 


64 


AMLE 


24.6 


-0.4 


8.0 


NN 


23.6 


0.4 


4.8 



Table 2. Comparison of results depending on the estimator. The columns are the characteristic of the estimator, if an inpainting of the 
simulations is made, if the linear term is added and if the mean was subtracted on the binned intermediate maps. Next columns are 
"■(/jvi) and (/nl) for 1,000 Gaussian simulations. Finally the relative error related to the minimum expected dispersion is shown in the 
last column. 



Estimator 






< /nl 


— Map 

/nl 


A/nl 


AMLED 




21.7 


-0.2 


33.4 


3±2 


AMLE 


21.3 


22.4 


-0.1 


39.8 


3±2 


NN 




21.4 


0.5 


44.2 


4±2 



Table 3. Results for inpainted Gaussian realizations. Model estimated and neural network trained using Eisner & Wandelt simulations 
(set 2). The columns from left to right are: the estimator used, the Fisher a computed from eg. 1101 the dispersion and mean value of /nl 
for 1,000 Gaussian simulations. Followed by the /nl value found for WMAP-7yr data and the contribution expected by the unresolved 
point sources (A/nl)- 



shown. The values for WMAP-7yr data are without point 
sources correction, which is given in the last column of the 
same table. The unresolved point sources co ntribution to 
. fNL is obt ained using the same pr ocedure as in lCurto et al.l 
l|2009t ) and lCasaponsa et al.l |20l3). As expected, by looking 
at the preliminary results, the tightest constraints are given 
by the NNE and AMLED estimators. For comparison, the 
WMAP-7yr map /nl estirnate wi th the optimal estimator 
obtained bv lKomatsu et al] l|201ll ) is 42, without the point 
sources correction. Note that the closest value is given by 
the NNE. The constraints for /nl with the point source 
contribution taken into account at 95% confidence level are 
—3 < /nl < 83 to be compare with —2 < /nl < 82 given 
by the optimal estimator. 



7 CONCLUSIONS 

We have trained a regression network with the binned bis- 
pectrum components of non- Gaussian realizations in order 
to obtain constraints on the local non-linear coupling param- 
eter /nl. We have compared the results with those obtained 
with a maximum-likelihood estimator, using either a diago- 
nal or a full covariance matrix. We also studied the effect of 
the addition of the linear term, mean subtraction and the 
use of inpainting. 

We find that the three estimators become close to op- 
timal if the linear term is subtracted and inpainting is per- 
formed. We find that the linear term is absolutely necessary 
if a diagonal covariance matrix is used. However, its efTect is 



very small if the full covariance matrix or the neural network 
is used and the mean is subtracte d from the binned map s, as 
found for wavelets and needlets in lDonzelli et al.l (|2012l ) and 
ICurto et all (|2012D . In that sense, the choice of the estima- 
tor depends on the difficulty of computing the linear term. 
Although the best results for all estimators are obtained 
when inpainted maps are used, the largest effect of this tech- 
nique is seen in the AMLED estimator, with the other two 
being less affected by the presence of a mask. Thus, the 
most robust tools are the AMLE and the NNE estimators, 
with the NNE displaying a clear computational advantage, 
since the covariance matrix does not need to be estimated 
or inverted; this reduces significantly the number of simu- 
lations required. Another advantage of the neural network 
estimator arises from the fact that for minimization the 
dependence of the covariance matrix on /nl makes a full 
solution computationally hard, if not unfeasible, for certain 
problems. Conversely, the NNE bypasses such calculations, 
thereby simplifying the analysis. 

We conclude that the most efficient tools are the neural 
network regression estimator and the AMLED estimator. 
The latter would be the choice if a small set of non-Gaussian 
simulations is available (~1,000), or analytical models are 
preferred. However, the AMLED depends on a specific pre- 
processing of the data. Neural networks give almost optimal 
results, without the use of inpainting, thereby avoiding the 
need to alter the data. 

Finally, the constraints for WMAP-7yr data, with the 
unresolved point sources contribution included, at 95% con- 
fidence level would be —3 < /nl < 83. These results are 
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compatible with / n l =0, as found in Komatsu et alj l|201lh ; 
ICurto et all (|201ll ) ; iBennett et all (|2012l ). Note that we have 
used foreground reduced maps, and the foregrounds have not 
been marginalised over in this analysis. 

We note that neural networks would be a useful method 
to estimate jointly other forms of non-Gaussianity, such as 
those where the number of outputs were set to a number of 
different /nl shapes (e.g. local, equilateral, orthogonal), but 
this is left for future work. 
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